HW 1
Question 1
Explain the following terms:
- Temporal locality: Repeated reference to a data item in a small time window.
- Very Long Instruction Word (VLIW) Processor: It relies on the compiler to resolve dependencies and resource availability at compile time. Instructions that can be executed concurrently are packed into groups and parceled off to the processor as a single long instruction word (thus the name) to be executed on multiple functional units at the same time.
- Single instruction multiple data (SIMD): A single control unit dispatches instructions to each processing unit.
- Cache hit ratio: The fraction of data references satisfied by the cache.
- Memory bound computation: Computations that are mainly hindered by the rate at which data can be pumped into the CPU.
- Cache line: A contiguous block of data returned from a single memory request.
- Instruction-level parallelism: the parallel or simultaneous execution of a sequence of instructions in a computer program.
- Data dependencies: The results of an instruction may be required for subsequent instructions.
- Superscalar processor: The ability of a processor to issue multiple instructions in the same cycle.
- Column major layout: Consecutive elements of a column reside next to each other in a memory system.
Question 2
- Fetch 1 word of vector 1 in 1 memory cycle 200 ns
- Fetch 1 word of vector 2 in 1 memory cycle 200 ns
- 1 pair of words 1 multiply-adds 2 FLOPs
- 2 FLOPs in 401ns 5 MFLOPS
- It's a memory bound process peak performance of the processor: 8 FLOPs > 2 FLOPs (sustained performance)
- Fetch 16 words of vector 1 in 1 memory cycle 200 ns
- Fetch 16 words of vector 2 in 1 memory cycle 200 ns
- 16 pair of words 16 multiply-adds 32 FLOPs (32 / 8 = 4 comp cycles)
- 32 FLOPs in 404ns 79.2 MFLOPS
- It's a compute bound process peak performance of the processor: 8 FLOPs < 32 FLOPs (peak memory performance)
Question 3
Assume each word fetched from the cache takes 1 memory cycle.
- Fetch 4 words of matrix A in 1 memory cycle 100 ns
- Fetch 4 words of the vector in from cache 1 memory cycle 4 ns
- 4 pair of words 4 multiply-adds 8 FLOPs (8 / 4 = 2 comp cycles)
- 8 FLOPs in 106ns 75 MFLOPS
Question 4
If can fetch consecutively:
- Fetch 8 words of matrix A in 1 memory cycle 200 ns
- Fetch 8 words of matrix B in 1 memory cycle 1600 ns
- 8 pair of words 8 multiply-adds 16 FLOPs (16 / 8 = 2 comp cycles)
- 16 FLOPs in 1804ns 8.9 MFLOPS
If can fetch any position:
- Fetch 8 words of matrix A in 1 memory cycle 200 ns
- Fetch 8 words of matrix B in 1 memory cycle 200 ns
- 8 pair of words 8 multiply-adds 16 FLOPs (16 / 8 = 2 comp cycles)
- 16 FLOPs in 402ns 40 MFLOPS
Question 5
Assume the cache reads one word per memory cycle. And each word read from DRAM takes 200ns.
The total execution time will be ns ns ns, where is the number of words read from DRAM, is the number of words read from the cache.
Question 6
- a[0] miss, load a[0] and a[1], a[1] hit
- a[1] hit, a[2] miss, load a[2] and a[3]
- a[2] hit, a[3] hit
- a[3] hit, a[4] miss, load a[4] and a[5]
- a[4] hit, a[5] hit
- a[5] hit, a[6] miss, load a[6] and a[7]
- a[6] hit, a[7] hit
Besides the first read operation, the two consecutive operations after having 3 hits and 1 miss. We can approximate the hit ratio as 75%.